08. Quiz: Split Sentences
In this exercise, you will read in some text from a file, split the text into sentences, and then each sentence into words (tokens). Instead of using the built-in Python string method .split()
, try using the Regular Expression package re
.
Come up with an appropriate regular expression that matches sentence delimiters, and use it like this:
sentences = re.split(r"<your regexp>", text)
Note the 'r
' preceding the regexp string - this denotes a raw string and tells Python not to interpret the characters in any special way (e.g. escape sequences like '\n'
do not get converted to newlines, etc.).
Specifying word delimiters is also pretty easy. Refer to the re
library documentation here for details.
Remember to remove leading and trailing spaces. If that results in any empty strings, drop them from the list that is returned.
Start Quiz:
"""Splitting text data into tokens."""
import re
def sent_tokenize(text):
"""Split text into sentences."""
# TODO: Split text by sentence delimiters (remove delimiters)
# TODO: Remove leading and trailing spaces from each sentence
pass # TODO: Return a list of sentences (remove blank strings)
def word_tokenize(sent):
"""Split a sentence into words."""
# TODO: Split sent by word delimiters (remove delimiters)
# TODO: Remove leading and trailing spaces from each word
pass # TODO: Return a list of words (remove blank strings)
def test_run():
"""Called on Test Run."""
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?"
print("--- Sample text ---", text, sep="\n")
sentences = sent_tokenize(text)
print("\n--- Sentences ---")
print(sentences)
print("\n--- Words ---")
for sent in sentences:
print(sent)
print(word_tokenize(sent))
print() # blank line for readability
User's Answer:
(Note: The answer done by the user is not guaranteed to be correct)
"""Splitting text data into tokens."""
import re
def sent_tokenize(text):
"""Split text into sentences."""
# TODO: Split text by sentence delimiters (remove delimiters)
lines = re.split(r'\s*[!?.]\s*', text)
# TODO: Remove leading and trailing spaces from each sentence
for item in lines:
if item == '':
lines.remove(item)
return lines # TODO: Return a list of sentences (remove blank strings)
def word_tokenize(sent):
"""Split a sentence into words."""
words = sent.split()
# TODO: Split sent by word delimiters (remove delimiters)
# TODO: Remove leading and trailing spaces from each word
return words # TODO: Return a list of words (remove blank strings)
def test_run():
"""Called on Test Run."""
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?"
print("--- Sample text ---", text, sep="\n")
sentences = sent_tokenize(text)
print("\n--- Sentences ---")
print(sentences)
print("\n--- Words ---")
for sent in sentences:
print(sent)
print(word_tokenize(sent))
print() # blank line for readability
INSTRUCTOR NOTE:
Note: The nltk
package is not available within programming quizzes, but you can use it going forward for labs and projects.